UDACITY Data Analyst Nanodegree: Project 4

Exploratory Data Analysis – Red Wine Quality

by Patrick Ferry

June 2018

Goal

I used to own an international chocolate brand. At times we marketed the product by pairing it with wine. It would have been helpful having expert knowledge of wine, in addition to chocolate, when pairing the products in order to refine the tasting experience for the customer. I anticipate after this project I’ll have a better understanding of which features most affect the quality of the wine.

Objectives

In this project, I will use the R programming language and apply exploratory data analysis techniques to:

  • Understand the distribution of a variable and check for anomalies and outliers
  • Learn how to quantify and visualize individual variables within a data set using appropriate plots such as scatter plots, histograms, bar charts, and box plots
  • Explore variables to identify the most important variables and relationships within a data set before building predictive models; calculate correlations, and investigate conditional means
  • Utilize powerful methods and visualizations for examining relationships among multiple variables, such as reshaping data frames and using aesthetics like color and shape to uncover more information

After loading packages, uploading the data file (‘wineQualityReds.csv’), and tidying the data set we are ready for our analysis. I renamed column X to ‘wine.id’ and converted quality to factored variable called ‘rating’: ‘not recommended’ (0,1), ‘mediocre’ (2,3), ‘good’ (4,5), ‘very good’ (6,7), ‘outstanding’ (8,9),‘classic’ (10); ratings according to Wine Spectator’s 100-point scale.

Introduction

While over 500 chemical compounds have been identified in wine, most produced naturally during fermentation, all wines have some basic elements in common including acid & sugar.

During fermentation, sugar is turned into alcohol when the skin of a ripe grape separates & the sugary juice on the inside makes contact with yeasts living naturally in the air & on surface of grape skin. Yeasts voraciously eat their way through the sugar and convert it into alcohol (leftover sugar makes for a sweeter wine).

Acid, when present in the right proportion, results in an intense & refreshing wine. Acid has the added benefit of acting as a preservative, while alcohol balances other flavors.

Other factors such as variety of grapes, optimum ripeness & yield, and soil quality contribute to the perfect wine, but ultimately we’re looking for an optimum balance of sugar and acidity.

Particularly, I would like to analyze these three variables (acidity, sugar, alcohol) and their affect on the quality of red wine in our data set. During the data analysis, I anticipate I may come across other variables that could also affect the quality of wine.

Our data set is limited to red variants of the Portuguese “Vinho Verde” wine, so I’m not sure how accurate modeling wine quality will be relative to red wine in general. Is our sample size large enough to draw robust conclusions about quality and make accurate predictions?

Let’s explore the data set more broadly to understand its structure.

Univariate Plots Section

Data Structure

## [1] 1599   14
## 'data.frame':    1599 obs. of  14 variables:
##  $ wine.id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ rating              : Ord.factor w/ 6 levels "not recommended"<..: 4 4 4 4 4 4 4 5 5 4 ...
##     wine.id       fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality                  rating    
##  Min.   : 8.40   Min.   :3.000   not recommended:   0  
##  1st Qu.: 9.50   1st Qu.:5.000   mediocre       :   0  
##  Median :10.20   Median :6.000   good           :  63  
##  Mean   :10.42   Mean   :5.636   very good      :1319  
##  3rd Qu.:11.10   3rd Qu.:6.000   outstanding    : 217  
##  Max.   :14.90   Max.   :8.000   classic        :   0

Observations

  • 1,599 samples of red wine
  • 11 numerical attributes that are physiochemical properties of wine, presumably contributing to its quality
  • Quality of wine is our outcome variable ranked on a scale of 1 to 10, although the range for this data set is 3 – 8
  • Citric.acid jumps off the page with a min value of 0.00
  • Several variables contain possible outliers outside 1.5x IQR from the median: fixed.acidity, residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide and sulphates
  • Median rank of the quality of red wine is 6.00

Plotting the Distributions

Let’s plot quality and rating to understand this distribution; afterall quality is our main focus in this analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## not recommended        mediocre            good       very good 
##               0               0              63            1319 
##     outstanding         classic 
##             217               0

The quality of wines range from 3.00 to 8.00 with a median of 6.00. 82.5% (1,319) of the wines are rated “very good”, per our rating, which fall right in the middle of the distribution. The data set lacks the very worst and very best quality wines, which may affect our models. It will be interesting to examine the distributions of each variable and investigate any correlations between variables.

Going forward, I will use ggplot syntax when plotting the distributions of the engagement variables, which typically have very long tails; i.e. orders of magnitude.

In the second plot, I’ll use log10 to transform the data into a normal distribution in order to see patterns more clearly without being distracted by tails. Linear regression assumes variables have normal distributions.

It is preferrable to use scale_x_log10 because we’re typically looking at actual counts rather than log units (log10 wrapper).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

fixed.acidity has a median of 7.90 and is positively skewed with some outliers greater than 15.0. A log10 scaling layer normalizes the distribution and eliminates the extreme outliers. Most of the acids involved with wine are fixed acids, with the notable exception of acetic acid (volatile.acidity). Presumably, volatile acids will have a greater effect on quality than fixed, as too high a concentration results in a vinegary taste.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
volatile.acidity is also positively skewed. Transforming the data presents a slightly bimodal distribution with peaks around 0.4 and 0.7. There are some outliers of higher acidic wines greater than 1.0, but the majority are around 0.5 in the normalized distribution. Volatile acidity is mostly caused by bacteria in the wine resulting in a vinegary taste. This is the first of three variables I suggest will effect wine quality most, so I will pay attention to its impact on quality as well as its correlation to other variables in the next section.

Let’s summarise to look closer at wine by acidity:

## # A tibble: 6 x 6
##   quality mean_acid median_acid min_acid max_acid     n
##     <int>     <dbl>       <dbl>    <dbl>    <dbl> <int>
## 1       3 0.8845000       0.845     0.44    1.580    10
## 2       4 0.6939623       0.670     0.23    1.130    53
## 3       5 0.5770411       0.580     0.18    1.330   681
## 4       6 0.4974843       0.490     0.16    1.040   638
## 5       7 0.4039196       0.370     0.12    0.915   199
## 6       8 0.4233333       0.370     0.26    0.850    18

The mean and median acidity levels decrease as wine quality increases, indicating a negative correlation to wine quality. We’ll confirm this hunch in the next section.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

citric.acid is an interesting distribution as it changes from long-tail right to long-tail left after transforming the data. The majority of wines have small amounts of citric.acid, with lots having none at all. In our initial observations, it is the only variable with a minimum value of 0.00; perhaps indicating missing data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
residual.sugar is very much positively skewed with extreme outliers far away from the median. Sugar content ranges widely from 0.90 to 15.50. There are a lot of wines with low sugar content between 1.0 - 3.0 (no sweet wines in our data set). Sugar is the second of three variables I predict will effect wine quality most.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

chlorides is similar to residual.sugar, having high concentrations around the median. Majority of wines have have low chloride levels roughly between 0.01 - 0.2. Transforming the data normalizes the distribution and shows more clearly most wines have chloride values between 0.05 - 0.10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

free.sulfur.dioxide is positively skewed with extreme outliers greater than 50. High concentrations can affect the smell and taste of wine. Transforming the data eliminates some outliers and returns a bimodal distribution, with peaks around 7.0 and 11.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

total.sulfur.dioxide is positively skewed, also with extreme outliers greater 280. A few wines have concentrations around 280, while at 30 there are over 300 wines. Upon transformation the distribution looks more normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

density looks to have a very normal distribution, with & without transformation, with little-to-no outliers. Density is affected by alcohol and sugar so we’ll examine further in the next section.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH also has a very normal distribution, with most wines ranging between 3.0 - 3.5, with outliers around 2.7 and 4.0. Most wines are acidic and range between 3.0 - 4.0 on the pH scale. It is possible pH has a greater than expected impact on wine quality so we’ll examine further in the next section.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

sulphates is positively skewed with a long-tail distribution. Transforming the data shows a fairly normal distribution, similar to density and pH. Most wines contain sulphate levels between 0.3 - 1.4, as our distribution shows. There are some outliers around the 2.0 level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
alcohol is positively skewed and when transforming the data the distribution remains the same. The majority of wines have an alcohol content between 9.5 - 10.0, which is interesting as the average content is 11.5% - 13.5%. Alcohol is the third of three variables I suggested will have a greater effect on wine quality.

Let’s summarise across quality:

## # A tibble: 6 x 6
##   quality  mean_alc median_alc min_alc max_alc     n
##     <int>     <dbl>      <dbl>   <dbl>   <dbl> <int>
## 1       3  9.955000      9.925     8.4    11.0    10
## 2       4 10.265094     10.000     9.0    13.1    53
## 3       5  9.899706      9.700     8.5    14.9   681
## 4       6 10.629519     10.500     8.4    14.0   638
## 5       7 11.465913     11.500     9.2    14.0   199
## 6       8 12.094444     12.150     9.8    14.0    18

The mean and median alcohol content increases with higher rated wines, indicating a positive correlation.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 wines in the data set with 11 numeric engagement variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfer.dioxide, density, pH, sulphates, alcohol). Quality, an integer, is our categorical variable and rating is an ordered factor.

Most of the distributions are positively skewed with some variables having extreme outliers (residual.sugar, chlorides, sulphates, free.sulfur.dioxide, total.sulfur.dioxide). Two distributions were normal (density, pH). Citric.acid is an interesting distribution and the only variable with a minimum of 0.00. 82.5% of the wines are rated ‘very good’ with quality scores of 5 and 6.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of interest in the data set. How certain variables affect quality is the objective of this analysis, ultimately leading us to build an accurate predictive model.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Initially I predicted acidity, sugar and alcohol would have the greatest impact on the quality of wine. However, after plotting and analyzing the univariate distributions it appears density and pH, with their very normal distributions, may have a greater impact on quality.

Did you create any new variables from existing variables in the dataset?

I created an ordered factor of rating to compare against quality. I considered creating a new variable ‘total.acidity’, comprised of fixed and volatile acidity, but left it alone. Upon further investigation I learned volatile.acidity is the variable that will most effect the quality of wine (vinegary taste).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Citric.acid was an unusual distribution with a min of 0.00, perhaps due to missing data. Also, quality is unusual in that we only have wines ranked 3 - 8. Where are wines 1-2 and 9-10? Again, perhaps a limited data set. Without lowest & highest quality wines to consider will our analysis be clear cut?

For optics and potential future analysis, I did tidy the data by renaming column ‘X’ to ‘wine.id’ and creating a factored variable called ‘rating’ from quality scores.

Bivariate Plots Section

Let’s first create a correlation matrix of the variables in our data set. A correlation matrix is used to investigate the dependence between multiple variables at the same time. The resulting table contains the correlation coefficients between each variable and the others.

Correlation: Negative correlation is a relationship between two variables in which one variable increases as the other decreases, and vice versa. A perfect negative correlation is represented by the value -1.00 (x increases, y decreases or vice versa), while a 0.00 indicates no correlation, and a +1.00 indicates a perfect positive correlation (x & y increase/decrease in tandem).

Observations

  • Alcohol (0.48) and volatile.acidity (-0.39) are most strongly correlated with the quality of wine
  • There is a meaningful but smaller correlation with the quality of wine in suphates (0.25) and citric.acid (0.23)
  • Residual.sugar (0.01) most clearly illustrates zero correlation with the quality of wine
  • pH (0.06), free.sulfur.dioxide (-0.05), total.sulfur.dioxide (-0.19), fixed.acidity (0.12), density (-0.17) and chlorides (-0.13) indicate little to no correlation

For the following analysis, let’s utilize boxplots to analyze each variable more closely against the above correlation matrix.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$fixed.acidity)
## t = 4.5953, df = 1597, p-value = 4.661e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06558369 0.16234953
## sample estimates:
##       cor 
## 0.1142376
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

Better rated wines have higher mean & median values of fixed.acidity, indicating a slight positive correlation (0.12), although not significant as the mean & median values are simialr as quality increases. Fixed.acidity doesn’t seem to affect wine quality.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$volatile.acidity)
## t = -16.99, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4319851 -0.3489201
## sample estimates:
##        cor 
## -0.3912492
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

High volatile.acidity negatively impacts the quality of wine (-0.39). Both the median and mean levels decrease as wine quality increases. Outliers are greatest in lower quality wines and the interquartile range decreases as quality improves.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$citric.acid)
## t = NaN, df = 1597, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  NaN NaN
## sample estimates:
## cor 
## NaN
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Most wines have small quantities of citric.acid. Higher quality wines appear to contain slightly more citric.acid, indicating a slightly positive correlation (0.23), peaking at a level of around 0.50. Interestingly, 0.00 levels of citric.acid are found in all wines except for the higest quality of 8 in our data set.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$residual.sugar)
## t = 0.94071, df = 1597, p-value = 0.347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02551727  0.07247084
## sample estimates:
##        cor 
## 0.02353331
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

residual.sugar has zero affect on quality of wine, in fact the best instance in our data set of having no correlation at all. The mean, median & interquartile range all hover around the same values. Lower rated wines have the most outliers, perhaps due to a limited data set. The log10 plot shows much of the same.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$chlorides)
## t = -7.1508, df = 1597, p-value = 1.308e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2232336 -0.1282260
## sample estimates:
##      cor 
## -0.17614
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Most wines have very low quantities of chlorides. There is a weak correlation between chlorides and quality. Outliers are found in lower quality wines. The log10 plot indicates lower amounts of chlorides are found in higher quality wine. The interquartile range is smallest for 8-quality wines. The correlation improves by 0.05 with a log10 transformation.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$free.sulfur.dioxide)
## t = -2.0041, df = 1597, p-value = 0.04522
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.098865884 -0.001068979
## sample estimates:
##         cor 
## -0.05008749
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

This is an interesting plot. The correlation is close to 0.00. Wines of lower and higher quality have less free.sulfur.dioxide and similar means and medians. Average wines have slightly higher levels of free.sulfur.dioxide. The log10 plot tells the same story.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$total.sulfur.dioxide)
## t = -6.8999, df = 1597, p-value = 7.476e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2173510 -0.1221403
## sample estimates:
##        cor 
## -0.1701427
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

There is little to no correlation between total.sulfur.dioxide and wine quality, as means, medians, and interquartile ranges of the best and worst wines are similar. There are extreme outliers in 7-quality wines interestingly.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$density)
## t = -7.1103, df = 1597, p-value = 1.74e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2222860 -0.1272452
## sample estimates:
##        cor 
## -0.1751737
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

Higher quality wines appear to have lower density. The best wine has the lowest median. However the upper quartile of the best wine is within the other medians. Given the plot and correlation, the relationship is not very strong. It would be interesting to look at the relation between density and alcohol as higher alcohol content may affect both density and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$pH)
## t = -2.3046, df = 1597, p-value = 0.02132
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106294995 -0.008576923
## sample estimates:
##         cor 
## -0.05757386
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.163   3.230   3.267   3.350   3.720

It looks as though there is a stronger relationship between pH and quality than the trend line or correlation suggest. The mean and median of pH decrease as the quality of wine increases. The dispersion of data points in better wines is still quite large. There is little to no correlatin between pH and quality, but perhaps in relation to other variables (SO2, acidity) we will gain some meaningful insights.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$sulphates)
## t = 12.967, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2636092 0.3523323
## sample estimates:
##       cor 
## 0.3086419
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

There is a relationship between better wines and more sulphates, albeit it slight, especially if we eliminate extreme outliers. Better wines appear to have higher concentrations of sulphates. The interquartile range for the best quality wines is rather small. Correlation between the two variables gets stronger by 0.05 after a log10 transformation.

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and log10(red$alcohol)
## t = 21.687, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4382062 0.5139842
## sample estimates:
##       cor 
## 0.4769811
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Higher quality wines appear to contain more alcohol. This is the clearest example of a strong, positive correlation from our data set. The mean alcohol content is greater in higher quality wines. The trend is so strong that the lower quartile for 8-quality wines is greater than the upper quartile for 6-quality wines. With many outliers, it is possible alcohol alone doesn’t contribute to better quality wine.

Alternative plot from revision:

ggplot(data = red, aes(x = factor(quality), y = alcohol)) +
  geom_jitter(alpha = 1/10) +
  geom_boxplot(alpha = 1/10, color = 'blue') +
  stat_summary(fun.y = 'mean', geom = 'point', color = 'red') +
  labs(x = 'Quality (score between 3 and 9)',
       y = 'Alcohol (% by volume)',
       title = 'Boxplot of alcohol across qualities')

Based on the bivariate analysis of plots and correlations, I now believe the following variables will best predict wine quality:

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

By definition, no single variable was strongly correlated with wine quality. Surprisingly sugar had the least affect on wine quality (0.01), despite my initial prediction as being one of three that would.

Let’s zoom in and take a closer look:

More wines of ‘very good’ (6,7 quality) contain sugar content across the spectrum, although lesser ‘good’ (4,5 quality) wines also contain higher content. It doesn’t appear sugar content greatly effects wine quality, but we’ll analyze further in the next section.

Fortunately, two of the three indicate a moderate correlation to wine quality and could possibly be used to develop an accurate, predictive model. Alcohol (0.48) and volatile.acidity (-0.39) should work well as both are less correlated to each other (-0.2) than each is to quality respectively.

Taken alone, density, pH, citric.acid, and fixed.acidity have weak to zero correlation with quality, however taken together with certain other variables does inform wine quality: alcohol : density (-0.50) density : fixed.acidity (0.67) pH : fixed.acidity (-0.68) pH : citric.acid (-0.54)

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Volatile.acidity (acetic acid), which is produced during fermentation, had a moderately negative correlation with wine quality.

Another variable that interested me was citric.acid. It has relatively high (negative) correlation with both volatile.acidity and pH, and exhibits a somewhat high correlation, relatively, with density. Previously, I did not have a grasp on how acidity affects wine, let alone the chemical components and interactions with other variables.

What was the strongest relationship you found?

Relative to quality alcohol had the strongest, positively correlated relationship (0.48)

Relative to other variables, fixed.acidity and pH were the strongest, negatively correlated variables (-0.68); while fixed.acidity and density / citric.acid were the strongest, positively correlated variables (both 0.67).

Multivariate Plots Section

As our analysis has evolved, so too has my understanding of what makes for better quality red wine. The variables I initially predicted would affect wine quality were acidity, sugar, and alcohol.

We will continue analyzing acids and alcohol, as they are major wine constituents and contribute greatly to its taste. Since we are working with a red wine data set, and not sweet wines, residual.sugar is less compelling and we will not consider it for our model. However, two other important variables need a closer look: pH and SO2.

pH plays a role in the stability of wine, while free.sulfur.dioxide is an effective preservative. Understanding the relationship between pH and sulfur dioxide (SO2) is critical. The higher the pH, the less SO2 will be in the useful free form and the less effective this free SO2 will be.

In this section, I’ll explore many variables at once, examine relationships among our main variables (pH, free.sulfur.dioxide, volatile.acidity, alcohol), and look for patterns predicting wine quality.

alcohol and volatile.acidity are the two most correlated variables with quality in our data set, as well as two of our main variables. The plot confirms that in better quality wines there is less volatile.acidity and more alcohol. Most wines with volatile.acidity > 0.60 are lower quality and have less alcohol content, < 10.0.

density and volatile.acidity are two of the least correlated variables with each other. Wine quality appears to be higher when volatile.acidity and density are low.

Regardless of density, most of the higher quality wines contain > 11% alcohol.

Alternative plots from revision:

ggplot(data = red,
       aes(x = density,
           y = alcohol,
           color = quality)) +
  coord_cartesian(xlim = c(0.985, 1.002),
                  ylim = c(7.5, 15)) +
  geom_jitter(size = 1) +
  geom_smooth(method = 'lm') +
  scale_x_continuous(breaks = seq(0.985, 1.002, 0.002)) +
  scale_color_brewer(type = 'seq', guide = guide_legend(title = 'Quality levels')) +
  theme_dark()

ggplot(data = red,
       aes(x = density, y = alcohol, color = factor(quality))) +
  coord_cartesian(xlim = c(0.985, 1.005),
                  ylim = c(5, 15)) +
  geom_jitter() +
  scale_color_brewer(type = 'seq') +
  theme_dark() +
  labs(x = 'Density (mg/l)',
       y = 'Alcohol (% by volume)',
       title = 'Relationship of density VS alcohol with colored quality levels')

alcohol and sulphates have low correlation to each other, but taken together present an interesting pattern. Wines with higher alcohol content typically had lower SO2 content vs. wines with lower alcohol content. Higher quality wines have more alcohol (> 10.0) and fewer sulphates (> 0.80).

sulphates and citric.acid are two higher correlated variables with quality (after alcohol & volatile.acidity). In this plot it appears wine improves in quality with greater concentration of sulphates, though regardless of citric acid. The outliers tend to be lesser quality wines.

pH has a weak correlation with quality and citric.acid is one of the higher correlations with pH. Better quality wine tends to contain more pH and citric acid.

Most measurments are for free.sulfur.dioxide (active) as the majority of bound SO2 (part of total SO2) is no use as a preservative. Low pH, (more acid), requires higher % of SO2 for quality wine.

From the analysis in this section we can see certain patterns emerge against a lot of noisy plots. Let’s examine the three most correlated variables to quality, but eliminate the mid-range (quality 5 & 6) wines and see if the visualization is clearer &/or shows us anything different.

Most high quality wines tend to contain low volatile.acidity, low sulphates, and high alcohol. There is still a lot of noise and overlap, indicating no one variable can entirely predict good wine quality.

A linear regression model is created to predict quality based on the physiochemical properties of our red wine data set. I forsee a couple of shortcomings negatively impacting our ability to predict wine quality with high degree of confidence: * 1) biased data set as wine is limited to red variants of the Portuguese “Vinho Verde” type * 2) based on our rating, there are neither ‘classic (10)’ nor ‘not recommended (0,1)’ wines (the vast majority are average quality)

I ran four linear regression models, ultimately submitting a model containing variables analyzed from the multivariate plot section. No single model was too far off from the others nor, more importantly, did any of the predictions of quality engender significant confidence. Perhaps a more complete data set would increase the model’s prediction.

## 
## Calls:
## lm1: lm(formula = as.numeric(quality) ~ (alcohol), data = red)
## lm2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity, 
##     data = red)
## lm3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     pH, data = red)
## lm4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     pH + free.sulfur.dioxide, data = red)
## lm5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     pH + free.sulfur.dioxide + density, data = red)
## lm6: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     pH + free.sulfur.dioxide + density + sulphates, data = red)
## lm7: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     pH + free.sulfur.dioxide + density + sulphates + citric.acid, 
##     data = red)
## 
## =========================================================================================================================
##                            lm1           lm2           lm3           lm4           lm5           lm6           lm7       
## -------------------------------------------------------------------------------------------------------------------------
##   (Intercept)             -0.125         1.095***      2.269***      2.275***     -9.743         1.486       -13.084     
##                           (0.175)       (0.184)       (0.369)       (0.369)      (10.762)      (10.791)      (11.928)    
##   alcohol                  0.361***      0.314***      0.330***      0.329***      0.338***      0.319***      0.339***  
##                           (0.017)       (0.016)       (0.017)       (0.017)       (0.019)       (0.019)       (0.020)    
##   volatile.acidity                      -1.384***     -1.279***     -1.283***     -1.282***     -1.161***     -1.329***  
##                                         (0.095)       (0.099)       (0.099)       (0.099)       (0.100)       (0.116)    
##   pH                                                  -0.422***     -0.412***     -0.377**      -0.289*       -0.465***  
##                                                       (0.115)       (0.116)       (0.120)       (0.119)       (0.134)    
##   free.sulfur.dioxide                                               -0.001        -0.001        -0.002        -0.002     
##                                                                     (0.002)       (0.002)       (0.002)       (0.002)    
##   density                                                                         11.839         0.004        15.173     
##                                                                                  (10.596)      (10.646)      (11.891)    
##   sulphates                                                                                      0.645***      0.668***  
##                                                                                                 (0.104)       (0.104)    
##   citric.acid                                                                                                 -0.382**   
##                                                                                                               (0.135)    
## -------------------------------------------------------------------------------------------------------------------------
##   R-squared                0.227         0.317         0.323         0.323         0.324         0.340         0.343     
##   adj. R-squared           0.226         0.316         0.321         0.321         0.321         0.337         0.340     
##   sigma                    0.710         0.668         0.665         0.665         0.665         0.658         0.656     
##   F                      468.267       370.379       253.328       190.155       152.397       136.411       118.592     
##   p                        0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood       -1721.057     -1621.814     -1615.101     -1614.724     -1614.097     -1594.978     -1590.942     
##   Deviance               805.870       711.796       705.845       705.512       704.959       688.301       684.835     
##   AIC                   3448.114      3251.628      3240.202      3241.447      3242.195      3205.956      3199.884     
##   BIC                   3464.245      3273.136      3267.087      3273.710      3279.835      3248.973      3248.279     
##   N                     1599          1599          1599          1599          1599          1599          1599         
## =========================================================================================================================

Given our data set, the r-squared statistic, a measure of how well the model is fitting the actual data, is low at 34.3%. R-squared is a measure of the linear relationship between our predictor variable (alcohol) and our response / target variable (quality). In multiple regression settings in our model, we see r-squared always increases as more variables are included in the model. Therefore, adjusted r-squared is the preferred measure, as it adjusts for the number of variables considered. Our r-squared value is 34.0%. Surprisingly, alcohol only contributes 22% of the wine quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The relationship between more alcohol and less volatile.acidity in better quality wine was clearly expressed. These were the two highest correlated variables with quality.

Additionally, density & volatile.acidity, sulphates & citric.acid, alcohol & density, and pH & free.sulfur.dioxide all displayed patterns for higher quality wine.

Introducting pH and free.sulfur.dioxide into the equation strengthened patterns affecting wine quality. pH was an important parameter to understand, measuring acidity (acetic) and affecting taste (vinegary). Understanding pH relationship with SO2 was also crucial in our multivariate analysis.

Were there any interesting or surprising interactions between features?

Wines with higher alcohol content contained lower SO2 content. Higher quality wines had alcohol levels > 10.0 and sulphate levels < 0.80. It was interesting seeing this visualization.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a linear regression model with variables analyzed from the multivariate plot section: alcohol, volatile.acidity, pH, free.sulfur.dioxide, density, sulphates, citric.acid. The model only fit the data at 34.0%, for reasons discussed above.

Final Plots and Summary

Plot One

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Description One

The quality of wines in our data set range from 3 to 8. 82.5% (1,319) of the wines are rated 5 or 6. Only 1.8% (28) of wines are rated 3 and 8 respectively. Not a single wine is ‘low’ quality at 1 or 2, nor ‘high’ quality at 9 or 10. As we determined throughout the analysis, the red wine data set is limited in comprehensive quality ratings and biased to a specific region and type of grape.

Plot Two

Volatile.acidity summary:

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Sulphates summary:

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

Description Two

Taking the three highest correlated variables with red wine quality, I created histograms and boxplots to present slightly different visualizations of each previously created: * Alcohol: The distribution shown has a positive skew. As such, the mean is larger than the median. We know the higher the alcohol content, the better the wine. Nevertheless, it would have been better to have more observations of higher &/or lower alcohol content to better inform our linear regression model. * Volatile.acidity: The second most influential variable in our data set reltaed to wine quality, albeit weak to moderately correlated. The highest quality wines bottom out in their acetic acid content, indicating a support level for the best wines. From the summary we see that the medians of 7 and 8 quality wines are exactly the same (0.37) and the inter-quartile range between the two is only 0.048. Furthermore, the inter-quartile ranges all overlap, indicating volatile.acidity alone doesn’t affect quality. * Sulphates: This plot illustrates the varying concentration of sulphates in our wines. As previously noted, this is an inverse relation to alcohol in higher quality wines; i.e. higher quality wines have more alcohol (> 10.0) and fewer sulphates (> 0.80). The shaded region corresponds with wines rated 5 or 6 quality.

Plot Three

Description Three

Simultaneous visualization of the three most influential variables realted to quality. The majority of wines, rated 5 and 6, have been eliminated to avoid excess clutter.

Most high quality wines tend to contain low volatile.acidity, low sulphates, and high alcohol; while the exact opposite occurs in low quality wines. There is still a lot of noise and overlap of the colors and sizes, indicating no one combination, of even these three most-correlated variables, can entirely predict good wine quality. There are a few high quality wines with lower alcohol and sulphate concentrations, however none that are heavy on volatile.acidity.

Reflection

Having completed the analysis of the red wine data set, and looking back at my initial goal of understanding red wine quality with respect to pairing it with chocolate, I believe I have a much stronger understanding of the winemaking process as a whole, and more specifically which chemical properties potentially determine wine quality. I say ‘potentially’ because we ultimately found that there isn’t a single variable, nor set of variables, that could predict wine quality with high confidence. Bearing in mind winemaking is a science and quality is objective, some interesting challenges arose.

The data set was flawed from the beginning. The sample was limited (too small, lack of extremes) and biased (grape types from single area in Portugal). Too many wines of average quality created a lot of noise, which isn’t necessarily helpful for building a linear model to predict wine quality. In the author’s notes the way the quality rating is calculated (median of subjective evaluations) could help explain why most of the wines are of quality five or six.

In the univariate plot section, the objective was to understand the distribution of a variable and check for anomalies and outliers. Using appropriate plots we learned how to quantify and visualize individual variables within a data set and check for outliers.

In the bivariate plot section, we explored variables to identify the most important relationships and patterns within our data set; calculating correlations and investigating conditional means.

Lastly, in the multivariate plot section, we utilized powerful methods and visualizations for examining relationships among multiple variables; such as reshaping data frames and using aesthetics like color and shape to uncover more information.

Despite the challenges, walking through these three sections of analysis provided some clarity in terms of trends and influential variables related to wine quality.

We saw that variables were generally positively or normally distributed. Alcohol and volatile.acidity were the two most correlated variables with quality, the former positively (> alcohol = > quality) and the latter negatively (< volatile.acidity = > quality). The case of the missing citric.acid values was eventually solved from reading additional wine resources explaining it is added to some wines to increase acidity, but not all; hence the 0.00 measures. Initially, I believed sugar would be a key component for measuring wine quality. It was not, but I quickly learned about fermentation and the properties of sweet vs. dry wine. I would have thought sulphates were a negative predictor of quality, but as we learned they are used as a preservative and the plot told us they exist, but not at the detriment of good quality. Also surprising, pH had little to no correlation with quality, although we later learned the crucial relationship between pH & SO2.

At the end of the day, and because winemaking is a science and quality is objective, we are ultimately searching for the right balance. The winemaker focuses on the chemical properties and the consumer can ‘subjectively’ choose his or her quality wine along measures of sweetness, acidity, tannin, alcohol and body.

For future analysis, the following would be helpful additions to our red wine data set:

References

Udacity - https://www.udacity.com/

Udacity Discussion Forum - https://discussions.udacity.com/c/nd002-data-analysis-with-r

Clarke, Oz, Introducing Wine: A Complete Guide for the Modern Wine Drinker,
Published November 1st 2004 by Harvest Books (first published 2000).

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at (@Elsevier): http://dx.doi.org/10.1016/j.dss.2009.05.016
Pre-press (pdf): http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
bib: http://www3.dsi.uminho.pt/pcortez/dss09.bib

Github - https://github.com/

Miscellaneous:
The Australian Wine Research Institute
Waterhouse Lab, UCDavis, University of California
MoreThanOrganic, French Natural Wine
Quench
ResearchGate
Miller, Mike. How SO2 & pH are Linked